Cosine Similarity Measure and Genetic Algorithm for extracting main content from web documents

نویسندگان

  • Digvijay B. Gautam
  • Pradnya V. Kulkarni
چکیده

Because of the use of growing information, web mining has become a primary necessity of world. Due to this, research on web mining has received a lot of interest from both industry and academia. Mining and prediction of user’s web browsing behaviors and deducing the actual content in a web document is one of the active subjects. The information on web is dirty. Apart from useful information, it contains unwanted information such as copyright notices and navigation bars that are not part of main contents of web pages. These seriously harm Web Data Mining and hence, need to be eliminated. This paper aims at studying the possible similarity criteria based on cosine similarity to deduce which parts of content are more important than others. Under vector space model, information retrieval is based on the similarity measurement between query and documents. Documents with high similarity to query are judge more relevant to the query and should be retrieved first. Under genetic algorithms, each query is represented by a chromosome. These chromosomes feed into genetic operator process: selection, crossover, and mutation until we get an optimize query chromosome for document retrieval. Our testing result show that information retrieval with 0.8 crossover probability and 0.01 mutation probability give the highest precision while 0.8 crossover probability and 0.3 mutation probability give the highest recall. KeywordsCosine Similarity; Genetic Algorithm; Fitness Function; Web Content Mining.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Content based Sentence Ordering using Spanning Tree Algorithm for Improved Multi Document Summarization

Due to the availability of required information in the web, as multiple documents, the need for summarizing these multiple documents and ordering of the sentences in the summary in an efficient way become a relevant task in data mining. We present a novel sentence ordering method based on maximum cost spanning tree algorithm to improve the readability and cohesion of the summary obtained by ext...

متن کامل

Web-Document Retrieval by Genetic Learning of Importance Factors for HTML Tags

In contrast to conventional documents, a Web document consists of a number of tags which provide hints on the structure of the documents. In this paper, we propose a Web-document retrieval method using the characteristics of HTML tags. This method learns the importance of tags from a training text set. We use a genetic algorithm for learning the importance weights. We also present a modi ed sim...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

A Comparative Study on Approaches of Vector Space Model in Information Retrieval

The vector space model is one of the classical and widely applied information retrieval models to rank the web page based on similarity values. The retrieval operations consist of cosine similarity function to compute the similarity values between a given query and the set of documents retrieved and then rank the documents according to the relevance. In this paper, we are presenting different a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014